The Accidental Taxonomist by Heather Hedden

The Accidental Taxonomist by Heather Hedden

Author:Heather Hedden [HEDDEN, HEATHER]
Language: eng
Format: epub
ISBN: 9781573879187
Publisher: Information Today, Inc.


Rules-Based Auto-Categorization

Rules-based auto-categorization puts more emphasis on matching text patterns than on statistical analysis. Software that is entirely rules-based is not quite as common as machine learning-based auto-categorization, simply because greater human intervention is required, and the customers for auto-categorization tools tend to seek the most “automated” method. However, tools that combine rules-based and machine learning-based auto-categorization are becoming more common. For rules-based auto-categorization, humans, often taxonomists, write rules for taxonomy terms. These rules are conditional statements that often involve Boolean logic and may use operators that look at word order, word proximity, and content structure to identify patterns in the text and apply taxonomy terms accordingly. The collection of rules is sometimes called a rule set or a knowledge base. Auto-categorization systems may automatically generate basic rules for every term, and the taxonomist then only needs to write additional conditional statements for the terms that are more ambiguous.

Sometimes a word found in the text matches a term in the taxonomy as a true synonym, in which case no conditional rules are required. However, when a term in the taxonomy has multiple meanings, rules are needed for clarification. Nonpreferred terms also require qualification through rules in order to serve as unambiguous matches. For example, the term earthquakes has a number of possible nonpreferred terms, but these all have other meanings: quake, tremors, trembler, and aftershocks. These all could be used as nonpreferred terms, provided there are rules restricting these terms to texts that also include certain other words or phrases, such as Richter scale, disaster, or structural damage.Both topical concepts and named entities may utilize rules. Rules are especially useful for distinguishing individuals with common surnames (Smith, Brown, Johnson); names that could apply to individuals, organizations, or places (Washington, Columbus, Madison, Jackson); or names that are also common words (Bush, Rice, Gates).

Rules can refer to many different conditions regarding the word or phrase of text to match. These include:

• Truncation of the word

• The mention of other words in the text

• Proximity to other words in the text

• Relation to other words in the text, based on Boolean AND, OR, and NOT operators

• Initial capitalization or full-word capitalization

• Text placement within a sentence (a rule usually used in combination with capitalization)



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.